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I. INTRODUCTION 



Comment on "Bayesian Analysis of Pentaquark Signals from CLAS Data" , with 
Response to the Reply by Ireland and Protopopsecu 

Robert D. Cousins 
Department of Physics and Astronomy, University of California, Los Angeles, CA 90095* 

(Dated: August 22, 2009) 

The CLAS Collaboration has published an analysis using Bayesian model selection. My Comment 
criticizing their use of arbitrary prior probability density functions, and a Reply by D.G. Reland 
and D. Protopopsecu, have now been published as well. This paper responds to the Reply and 
discusses the issues in more detail, with particular emphasis on the problems of priors in Bayesian 
model selection. 

Q\ '. PACS numbers: 06.20.Dk, 07.05. Kf, 12.39.Mk, 14.80.-j 

o- 

(n: 

bo: 

3. 

<■ 

The CLAS Collaboration [1] , citing the pioneering work of Harold Jeffreys [2] , has published a Letter [1] that claims 

to illustrate a general method that "could be applied to any data set where a search for a new state has been carried 

out" , providing a "quantitative measure" for judging potential discovery results by using the formalism of Bayesian 

i—i, model selection. My Comment [3] criticizing their use of arbitrary prior probability density functions, and a Reply 

SZ • [4] by D.G. Ireland and D. Protopopsecu, have now been published as well. As I believe that the Reply did not 

Mh' satisfactorily address the points raised by my Comment, and that it was not informed by other points in the articles 

Q,' I cited, I elaborate in this post to the arxiv. 

All Bayesian calculations, and in particular model selection results, are potentially sensitive to the choice of prior 

,-S^ . pdf. My short Comment [3], reproduced in Sec. II, focused on the statement in the Letter [1]: "We assume that 

each prior is a uniform distribution between a lower and upper limit since this represents the least initial bias." This 

statement goes against the entire thrust of Jeffreys' book and subsequent research: Jeffreys explains in convincing 

^- ■ detail the contradictions one reaches by such use of Laplace's idea of a uniform prior. Of course, a lot has been learned 

in the nearly half century since Jeffreys' last edition appeared in 1961, and my Comment included references to some 

C*~) i more recent key papers and discussion, notably by Jose Bernardo and Jim Berger and collaborators, and the review 

by Kass and Wasserman. 

My Comment focused on problems in prior specification which are present already in Bayesian parameter estimation. 
In Bayesian Model Selection, there is an additional major concern arising from computing a ratio of two integrals 
that are evaluated in parameter spaces of different dimensionality. In the CLAS Collaboration's Letter [1], one has 
a four-dimensional prior in the denominator, while the prior in the numerator has the same four dimensions plus 
three additional ones. Specification of the prior in these extra dimensions without arbitrarily affecting the answer is 
a difficult problem which is not addressed by the uniform priors described in the Letter. 

To spell out these objections further, in Sec. Ill I respond to some specific statements in the Reply [4]. Sec. IV 
explains why the evidence ratio in the Letter has a steep dependence on the arbitrarily-chosen upper and lower limits 
referred to in the above quote. Indeed the method advocated in the Letter, in which "The prior parameter ranges 
were established by performing an initial fit and setting the limits to be ±50% of the values found" [1], is of a type 
cautioned against in the Bayesian literature. I conclude in Sec. V. 

II. MY COMMENT AS PUBLISHED IN PRL [3] 

The CLAS collaboration, in presenting a Bayesian analysis of two searches for pentaquarks [1], suggests "an al- 
ternative means of quantifying the evidence for discovery. What is specifically required is a quantitative comparison 
between the two hypotheses. . . " , and concludes, "More generally, this method could be applied to any data set where 
a search for a new state has been carried out, and can provide a quantitative measure with which to judge whether 
or not a result represents a discovery." Especially given the repeated emphasis on "quantitative", one must challenge 
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the statement regarding the prior probability densities that CLAS used for the continuous parameters £: "We assume 
that each prior is a uniform distribution between a lower and upper limit since this represents the least initial bias." 
This assumption of uniform priors for up to seven parameters in arbitrary metrics (while also not defining "bias" ) is 
in conflict with the large literature on priors (both subjective and non-subjective), including the influential work of 
Jeffreys [2] cited by Ref. [1] as the basis of their method. 

For a continuous parameter £ with reasonably behaved Baycsian prior probability density function P(£\M), one 
can use the probability integral transformation [5] to construct a function £(£) for which the P(£|M) is uniform. 
Thus the choice of P(£\M) is equivalent to the choice of the metric £(£) in which P is uniform; arbitrarily assuming 
£(£) = £ as implicit in Ref. [1] is without justification. In parameter estimation problems, Jeffreys advocated a 
general rule for non-subjective priors using the Fisher information, although he saw the need for modification in cases 
of multiple parameters. Advocates of so-called "objective Bayesianism" , Bernardo [6] with Berger [7, 8], and others 
then developed "reference priors" for multiple parameters. For model selection, other considerations may lead to yet 
other priors [2, 9]. 

The many issues of priors selected by such "formal rules" are reviewed by Kass and Wasserman [10] and by Berger 
and Pericchi [9]. Bernardo discusses his views in Ref. [11], with commentary from statisticians. Other spirited 
discussion follows articles by Berger [12] and by Goldstein [13] who advocate, respectively, the objective and subjective 
Bayesian approaches. Efron has also noted, [14], "Perhaps the most important general lesson is that the facile use of 
what appear to be uninformative priors is a dangerous practice in high dimensions." 

For Bayesian analyses relevant to a community of scientists one should study the sensitivity of the result to variation 
of the priors. Goldstein [13] says, "We can then produce the range of posterior judgements, given the data, which 
correspond to the range of 'reasonable' prior judgements held within the scientific community." Bernardo [11] says, 
"Non-subjective Bayesian analysis is just a part - an important part, I believe - of a healthy sensitivity analysis to 
the prior choice. . . " Berger also argues that objective priors such as Jeffreys' priors and their generalizations lead to 
good frequcntist properties which are welcomed by many statisticians although not part of the Bayesian paradigm. 

In summary, Ref. [1] does not justify its assumption of uniform priors in the chosen parameter metrics £. A 
"healthy" sensitivity analysis should accompany any use of this method for scientific communication, but is absent 
in Ref. [1]. Following prominent objective Bayesians [11, 12], it would also be useful to understand the frequcntist 
properties of the method in order to facilitate comparison of results from different paradigms. 

III. RESPONSE TO THE REPLY TO MY COMMENT 

I do not believe that the Reply [4] by Ireland and Protopopsecu adequately addresses my criticisms with respect to 
the priors. There is also no study presented of the frequentist sampling properties of their result. Thus the scientific 
conclusions one obtains by following the general method as presented in the Letter [1] are without proper foundation, 
and cannot be interpreted in the "quantitative" fashion claimed in the Letter. 

Below are some italicized quotes from the published Reply and my further comments. 

"The choice of a Gaussian function to represent a peak is a standard one, since the three parameters required to 
specify it are related to quantities with physically meaningful interpretations: centroid position, detector resolution 
and signal strength. The limits on the possible value of centroid position correspond to the mass range for which the 
experiment is sensitive, so it is a location parameter for which a uniform prior is a reasonable choice. " 

In this (August 2009) version of this note, I modify my comment on this point. My earlier comment contained the 
following two paragraphs: 

"The fact that a parameterization is a standard representation in physics formulas has nothing to do with whether 
or not the those are the metrics in which the prior should be flat. The claim that mass is a location parameter for 
which a uniform prior is reasonable is false. The prototypical example that Jeffreys used to show the non-universal 
applicability of uniform priors is in fact a physical parameter with a semi-infinite range of possible values: the charge 
of the electron (pp. 104-105 in the 2nd edition, pp. 119-120 in the third edition). His analysis applies equally to mass. 
Those pages are in the part of the book on estimation, and as noted in my Comment, one can be led to different 
priors in model selection; but in neither case is mass a location parameter with uniform prior." 

"The classic example of a location parameter is the mean of a Gaussian where 9 is in (— oo, oo). But taking a 
semi-infinite physical parameter such as mass and smearing it with a Gaussian resolution function does not turn it 
into a location parameter." 

I now think the issue is not as clear-cut as I portrayed it, for very interesting reasons. The above argument by 
Jeffreys is an argument based on considerations of the physical parameter itself. This is in contrast to the argument 
(in the same book) leading to Jeffreys's Rule for priors that are known as "Jeffreys priors" , and that are generalized to 



the Reference Priors of Bernardo and collaborators. Jeffrey's Rule for priors is completely based on the measurement 
process, i.e., the specific experimental setup one uses in order to "make the measurement"; this is what statisticians 
typically call "the model" . This is distinct from considerations (such as the scaling argument for the electron charge) 
about the physical parameter itself! That is why, for instance, the Jeffreys prior for the binomial parameter is different 
in the binomial model than in negative binomial model, in violation of the strong likelihood principle, even if it is the 
same physical parameter. So when using Jeffrey's Rule, it is indeed the case that simply measuring a parameter with 
a Gaussian resolution function implies a uniform prior, independent of one's notions about the physical parameter 
itself! 

As to the issue of a restriction on allowed values of the parameter, Berger [17] addresses exactly this point: 

"One important feature of the Jeffreys noninformative prior is that it is not affected by a restriction on the 
parameter space. Thus, if it is known in Example 5 [in which 8 is a location parameter] that 6 > 0, the Jeffreys 
noninformative prior is still tt(9) = 1 (on (0, oo), of course). This is important, because one of the situations in which 
noninformative priors prove to be extremely useful is when dealing with restricted parameter spaces (see Chapter 4). 
In such situations we will, therefore, simply assume that the uninformative prior is that which is inherited from the 
unrestricted parameter space." 

My above-quoted comments on this point were thus wrong from the point of view of objective Bayesianism; the 
scaling argument of Jeffreys which I quoted is eclipsed by Jeffreys's Rule. On the one hand this might seem to be a 
compelling argument against such "objective" priors from Jeffreys's Rule, but on the other hand it is worth recalling 
arguments in favor of such objective priors, namely that they "let the data speak the loudest", and that since the 
time of Welch and Peers they are associated with probability matching to a certain order. 

This discussion seems to make quite clear how so-called objective priors are different from subjective priors. 
Subjective priors encode an individual's prior beliefs about the parameter, while objective priors encode properties 
of the measuring apparatus (including the stopping rule, in violation of the likelihood principle). The arguments I 
quoted advocating a sensitivity analysis would seem to take on even greater force in these circumstances! 

"The width and height of the Gaussian are, strictly speaking, scale parameters for which a Jeffreys' prior may 
be appropriate. However, for the calculation of evidence integrals, we require normalized priors, so a Jeffreys' 
prior [f(x) ex 1/x] needs to be normalized between two limits. Limits on the width of the peak are naturally sug- 
gested: the minimum by detector resolution and a maximum such that a peak is not confused with a background shape. " 

The uniform prior is also unnormalizable over all space. The authors do not explain why they chose a truncated 
uniform prior rather than a truncated version of some other unnormalizable prior, if that was the concern. But the 
larger issue is that if one is in a situation where truncation is needed in order to make the prior normalizable, then 
the result of the model selection calculation will depend on the endpoints used for the truncation. I come back to 
this crucial point in Sec. IV below. 

"For the signal parameter, we must include zero, since the posterior probability density function shows that, 
even in cases where the maximum likelihood is non-zero (i.e. a peak is most likely), there is still a significant 
probability that the results are consistent with zero signal. The Jeffreys' prior is undefined at x — 0, so an alterna- 
tive could be to use a gamma distribution, which is normalized, has a similar decay as x — > oo, and is defined as x — > 0. 

The Bayesian literature on model selection (starting with Jeffreys) has a lot of discussion about the situation, as 
we have here, where the crucial scientific question is whether or not a particular parameter is zero (i.e., corresponding 
to the classical case of a point null hypothesis). There are various ways to approach this, for example concentrating 
some prior in a delta function at the point, and spreading out the rest. (This is already in Jeffreys' 2nd edition.) 
A more comprehensive look at the professional literature is needed to understand the subtleties which have been 
ignored here. 

"We studied the problem with these alternative priors, and found no significant difference in the results obtained; the 
use of uniform priors was thus motivated by using the simplest form appropriate to the problem. " 

Without a quantitative exposition or knowing the scope of the sensitivity analysis, one cannot evaluate the claim 
that their method provides a "quantitative" result. Were the limits of the mass range changed? What range of 
shapes of gamma functions was explored? 

"In addition, the sensitivity of the results to the parameters of the background is minimal. The evidence ratios (or 
Bayes Factors) are in the form of (logarithms of) ratios of integrals. Any slowly varying prior in the space of the 
background parameters will thus result in approximately constant factors that cancel. " 



Again, whether something is "minimal" or "approximately constant" is in the eye of the beholder, so the authors 
should provide the list of alternatives explored and the numerical results. 

"In the calculations, the likelihood functions for each data model must be integrated over all parameter space. The 
effect of priors is to weight the likelihood functions. In practice, the likelihood functions for the study in the letter are 
significantly non-zero in only a small region of parameter space. It would thus take a rapidly varying prior in this re- 
gion to make a noticeable difference to the integrals, and one would of course have to justify this rapidly varying prior. " 

A rapidly varying prior in their chosen metric will be a uniform prior in another metric; what is the criterion for 
choosing the preferred metric? As I discuss below, the location of the endpoints of the uniform prior can affect the 
answer, even if the likelihood is zero near the endpoints. 

With seven parameters, the authors may be surprised to find that the volume effect (most of the prior probability 
located near the boundary of their 7-D hypercube) distorts the posteriors. 

We used the term "least bias" to indicate that, as far as possible, we wanted to see what information one could 
extract from the data, whilst introducing only minimal prior prejudice. 

As shown by Jeffreys [2] and reviewed by Kass and Wasscrman [10], using uniform priors in the manner of the 
Letter [1] does not satisfy this desire. This is also extensively discussed by Bernardo in the "dialogue" cited in my 
Comment [11]. 

How one achieves this is an open question, and it should be noted that the debate within the statistics community 
appears far from settled, as attested by the papers cited in the Comment [2,3], and subsequent contributions in the 
same publication. 

This was the reason I wrote the Comment: The Letter's statement that their uniform priors had the "least bias" 
was without foundation. It appears that the authors of the Reply now agree. 

"To conclude, the use of alternative priors makes little difference to the results of our study. The measured data in 
this case therefore contain sufficient information to dominate the calculation of the probabilistic quantities of interest. " 

The reader should be provided with details of the alternative priors and numerical results in order to judge what 
"little difference" means, since Sec. IV below indicates that this is not the case. 

Actual claims of discovery would require a more detailed examination of evidence than presented in the Letter. We 
therefore fully agree with the author of the Comment that, in general, a sensitivity analysis of results to the choice of 
priors (and data models) is essential. " 

The main scientific point of the Letter [1] was to claim that the first CLAS result was actually "inconclusive" (and 
if anything weak evidence against a peak), contradicting CLAS's earlier published claim of 5.2a "observation" of a 
peak. It would seem that such an extraordinary situation should be backed up with the sort of "detailed examination" 
that the Reply authors agree is required for a discovery. This should include the dependence on the limits (endpoints) 
of the prior as discussed in Sec. IV below. 

IV. THE PROBLEM OF DIFFERING DIMENSIONS IN MODEL SELECTION AND THE DANGERS 

OF ARBITRARILY TRUNCATING THE PRIOR 

A difficulty in the present model selection calculation is a common one in Bayesian analysis: Model A has four 
parameters, and Model B has the same four parameters, plus three more. If the priors for the three additional 
parameters are unnormalizable (e.g. a uniform prior extending to oo), then the answer is completely arbitrary: 
multiplying the prior by a constant in the numerator but not the denominator (where it does not appear) will change 
the evidence ratio. The Reply alludes to this, and in the Letter the uniform prior was truncated at a location well 
outside the range where likelihood function is non-negligible. This truncation is perhaps plausible-sounding since such 
a truncation may be made without severe effect in estimation problems. However, the effect in model selection is 
unfortunately that which often occurs in physics when dealing with an infinity by introducing a cutoff: one simply 
replaces the infinity with a dependence on the cutoff, which is arbitrary. 



The Letter set the endpoints to be ±50% of the best- fit values. ("The prior parameter ranges were established 
by performing an initial fit and setting the limits to be ±50% of the values found" [1].) The Reply [4] appears to 
contradict this statement from the Letter, by saying that zero was included in one interval; but as this statement 
in the Letter has not been explicitly retracted, we can take it as the generally applicable method advocated by the 
CLAS Collaboration. Then let / denote this percentage, i.e. / = 0.5 in the Letter, and let £ denote the best-fit value 
of a parameter £. Then the prior is: 

P(t\M) = ( 1/(2/ ^' ^ " ^ < * < ** + K (1) 

VSI ; \ 0, otherwise. v ; 

Thus, even if the likelihood functions are all negligible near the endpoints of the uniform prior, after all integrations 
are performed the arbitrary factor (l//£) will appear in the numerator of the evidence ratio, and not be canceled be 
a factor in the denominator. Thus the arbitrary constant of an unnormalizable uniform prior is just replaced by an 
arbitrary constant determining the height of the normalized uniform prior (via the arbitrary specification of the width 
and the normalization condition). 

As the Letter has one such factor / for each of its three extra parameters in the numerator, the evidence ratio goes 
as / 3 . That is, if / is varied from 25% to 75%, the evidence ratio changes by a factor of 27, and its logarithm, ln(i?s), 
changes by 3.3 units. This is enough to change the evidence by two categories of strength in Jeffrey's scale! 

The value of / used for the mass parameter deserves special attention, since it controls the "Occam's razor" effect 
due to where one is looking in mass: a firm prediction of the mass of the pentaquark, followed by a peak at that 
location, should result in an enhanced evidence ratio resulting from a small value /. It is not clear to what extent the 
analysis in the Letter in fact reduced the evidence ratio derived from the first data set by using an inflated value of /. 

Given such strong dependence on an arbitrary parameter /, it is hard to comprehend the claim in the Reply that 
a sensitivity analysis was performed with "no significant difference in the results obtained" . 

(The Letter was not clear as to whether or not the limits on the four parameters in the 3rd-order polynomial were 
the same in the numerator as in the denominator. If different "values found" by the fit led to different limits in the 
numerator and denominator, then this adds to the non-canceling dependence on /.) 

This issue is of course known in the Bayesian literature that I cited in my Comment. For example, Bcrger and 
Pericchi [9] list "Difficulty 3. Use of vague proper priors usually give bad answers in Bayesian model selection", with 
a specific example where the Bayes factor depends the arbitrarily chosen "large" value of K, which in their example 
sets the size of the region over which the prior is appreciable. They conclude, "The short story is never use vague 
priors for model selection..." (and in fact prefer an improper prior in that particular example). The limited- length 
uniform priors used in the Letter suffer from the same disease of the Bayes factor depending on the arbitrary /. 

In his article in Bayesian Analysis cited by my Comment, Jim Bcrger [12] has a section entitled "Dangers of 
Casual Objective Bayesian Analysis" which is even more explicit in urging caution in use of u pseudo-Bayes^ analyses: 
"...while they utilize Bayesian machinery, they do not carry with them any of the guarantees of good performance that 
come with cither true subjective analysis (with a very extensive clicitation effort) or (well-studied) objective Bayesian 
analysis... and hence must be validated by some other route." He specifically warns in a section entitled "Truncation 
of the parameter space" that truncation at large ±K to avoid having an improper pdf must be done with care: "At 
the very least, this approach should only be used if a very careful sensitivity study is done with respect to these 
bounds (and with bounds for different parameters varying independently in the sensitivity study." The context of 
these quotes is in terms of avoiding an improper posterior by truncating an improper prior, but the concern is exactly 
paralleled in truncation of the prior in the model selection problem of the Letter [1]. 

The last subsection in Berger's section on "Dangers of Casual Objective Analysis" is worth quoting extensively: 
[12]: 

"Data-dependent vague proper priors. The second common data-dependent procedure is to choose priors that 
span the range of the likelihood function. For instance, one might choose a uniform prior over a range that includes 
most of the mass of the likelihood function, but that does not extend too far (thus hopefully avoiding the problem 
of using a 'too vague' proper prior). Another version of this procedure is to use conjugate priors, with parameters 
chosen so that the prior is spread out somewhat more than the likelihood function, but is roughly centered in the 
same region. The two obvious concerns with these strategies are that (i) the answer can still be quite sensitive to the 
spread of the rather arbitrarily chosen prior; and (ii) centering the prior on the likelihood is a quite problematic double 
use of the data. Also, in problems with complicated likelihoods, it can be very difficult to implement this strategy 
successfully. ... In conclusion, while these pseudo-Bayesian techniques can be useful as data exploration tools, they 
should not be confused with formal objective Bayesian analysis, which has very considerable extrinsic justification as 
a method of analysis." 



V. CONCLUSION 

If we are going to use Bayesian techniques in our research, then we should read and understand a representative 
sampling of the relevant Bayesian literature. I urge anyone contemplating an objective Bayesian analysis to read 
Kass and Wasserman [10] before attempting to write down a so-called objective or noninformative prior in a desire 
to "represent the least initial bias" . If one wants to go beyond parameter/interval estimation and get into model 
selection, the article by Berger and Pericchi [9], also cited in my Comment, is a must for beginning to appreciate the 
difficulty of the subject and potential pitfalls; the volume has other valuable articles and commentary as well. For 
another perspective and pointers to a much broader discussion, see Chapter 6 (including the "Bibliographic Note" 
in Sec. 6.9) of the text by Gelman et al. [16]. Kass and Raffcrty give another brief synopsis in Sec. 5.1 of their 
article on Bayes Factors [15]. Of course Jeffreys' classic monograph also still provides insightful reading and historical 
perspective. 

The articles by Berger and by Goldstein and the ensuing discussion in Bayesian Analysis [12, 13] are a great 
introduction to the discussion within the Bayesian community. While there is quite a spirited discussion, it is clear 
that there is a consensus recommendation for a "healthy" sensitivity analysis in any Bayesian analysis used for 
scientific communication. In a Model Selection analysis, particular caution is needed when using priors in which 
arbitrary constants in the normalization do not cancel in the evidence ratio. The method advocated by the CLAS 
Collaboration [1], while applying the established Bayesian model selection formalism, used such arbitrary inputs and 
thus the "quantitative" output should also be regarded as arbitrary, until a "healthy" sensitivity analysis is displayed, 
and/or the sampling properties are understood. 
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